Unsupervised Feature Selection by Means of External Validity Indices

نویسنده

  • Javier Béjar
چکیده

Feature selection for unsupervised data is a difficult task because a reference partition is not available to evaluate the relevance of the features. Recently, different proposals of methods for consensus clustering have used external validity indices to assess the agreement among partitions obtained by clustering algorithms with different parameter values. Theses indices are independent of the characteristics of the attributes describing the data, the way the partitions are represented or the shape of the clusters. This independence allows to use these measures to assess the similarity of partitions with different subsets of attributes. As for supervised feature selection, the goal of unsupervised feature selection is to maintain the same patterns of the original data with less information. The hypothesis of this paper is that the clustering of the dataset with all the attributes, even when its quality is not perfect, can be used as the basis of the heuristic exploration the space of subsets of features. The proposal is to use external validation indices as the specific measure used to assess well this information is preserved by a subset of the original attributes. Different external validation indices have been proposed in the literature. This paper will present experiments using the adjusted Rand, Jaccard and Folkes&Mallow indices. Artificially generated datasets will be used to test the methodology with different experimental conditions such as the number of clusters, cluster spatial separanton and the ratio of irrelevant features. The methodology will also be applied to real datasets chosen from the UCI machine learning datasets repository.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Variable Selection: when random rankings sound as irrelevancy

Whereas the variable selection has been extensively studied in the context of supervised learning, the unsupervised variable selection has attracted attention of researchers more recently as the available amount of unlabeled data has exploded. Many unsupervised variable ranking criteria were proposed and their relevance is usually demonstrated using either external cluster validity indexes or t...

متن کامل

Canonical PSO Based K-Means Clustering Approach for Real Datasets

"Clustering" the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different ...

متن کامل

Validation of unsupervised clustering methods for leaf phenotype screening

The assessment of visible differences in leaf shape between plant species or mutants (phenotyping) plays a significant role in plant research. This paper investigates the application of unsupervised data clustering techniques for phenotype screening to find hidden common shape categories. A set of two wildtypes and seven mutations of Arabidopsis acted as a test case. K-Means, NG, GNG, SOM and A...

متن کامل

Development of An External Cluster Validity Index using Probabilistic Approach and Min-max Distance

Validating a given clustering result is a very challenging task in real world. So for this purpose, several cluster validity indices have been developed in the literature. Cluster validity indices are divided into two main categories: external and internal. External cluster validity indices rely on some supervised information available and internal validity indices utilize the intrinsic structu...

متن کامل

Rough-Fuzzy Clustering and Unsupervised Feature Selection for Wavelet Based MR Image Segmentation

Image segmentation is an indispensable process in the visualization of human tissues, particularly during clinical analysis of brain magnetic resonance (MR) images. For many human experts, manual segmentation is a difficult and time consuming task, which makes an automated brain MR image segmentation method desirable. In this regard, this paper presents a new segmentation method for brain MR im...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013